Regex::RxParser
class.
In addition, if you need further info: "use the source - Luke".
These pages will not teach you regular expression usage nor the Smalltalk language.
For regular expressions, the following excellent book is recommended:
aString matchesRegex: '16r[[:xdigit:]]+'
(Coding the same ``the hard way'' is an exercise to a curious reader).
This matcher is offered to the Smalltalk community in hope it will be useful. It is free in terms of money, and to a large extent -- in terms of rights of use. Refer to `Boring Stuff' section for legalese.
String » matchesRegex:
method offers.
Happy hacking,
Vassili Bykov
<vassili@objectpeople.com>
<vassili@magma.ca>
August 6, 1996 (first release)
April 4, 1999 (rel1.1)
\w
[a-zA-Z0-9_]
)
\W
[^a-xA-Z0-9_]
)
\d
[0-9]
)
\D
\s
\S
\b
\B
\<
\>
'\w+'
is now a valid expression matching any word.
\w
,\W
,\d
,\D
,\s
, and\S
.
[:alnum:]
[:alpha:]
[:blank:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
For example, the following patterns are equivalent:
'[[:alnum:]]+'
'\w+'
'[\w]+'
'[a-zA-Z0-9_]+'
\t tab (Character tab)
\n newline (Character lf)
\r carriage return (Character cr)
\f form feed (Character newPage)
\e escape (Character esc)
#asRegexIgnoringCase
#matchesRegexIgnoringCase:
#prefixMatchesRegexIgnoringCase:
matchesIn: aString
matchesIn: aString collect: aBlock
matchesIn: aString do: aBlock
matchesOnStream: aStream
matchesOnStream: aStream collect: aBlock
matchesOnStream: aStream do: aBlock
copy: aString translatingMatchesUsing: aBlock
copy: aString replacingMatchesWith: replacementString
copyStream: aStream to: writeStream translatingMatchesUsing: aBlock
copyStream: aStream to: writeStream replacingMatchesWith: aString
Examples:
'\w+' asRegex matchesIn: 'now is the time'
returns an OrderedCollection containing four strings: 'now', 'is', 'the', and 'time'.
returns 'now is THE TIME' (the regular expression matches words beginning with either an uppercase or a lowercase T).
'a' matchesRegex: 'a' "-> true"
'foobar' matchesRegex: 'foobar' "-> true"
The above paragraph introduced a primitive regular expression (a character),
and an operator (sequencing).
Operators are applied to regular expressions to produce more complex regular expressions.
Sequencing (placing expressions one after another) as an operator is,
in a certain sense, `invisible'--yet it is arguably the most common.
'blorple' matchesRegex: 'foobar' "-> false"
'abc' matchesRegex: 'a..' "-> true"
'abcd' matchesRegex: 'a..' "-> false"
actually it matches any 3-character string,
except those which include a newline character.
'ab' matchesRegex: 'a*b' "-> true"
'aaaaab' matchesRegex: 'a*b' "-> true"
'b' matchesRegex: 'a*b' "-> true"
'aac' matchesRegex: 'a*b' "-> false: b does not match"
'123aa' matchesRegex: '.*aa' "-> true (matches any string which ends with 'aa', but not containing a newline)"
'123aa456' matchesRegex: '.*aa.*' "-> true (matches any string containing 'aa', but not containing a newline)"
A star's precedence is higher than that of sequencing.
A star applies to the shortest possible subexpression that precedes it.
For example, 'ab*' means `a followed by zero or more occurrences of b',
not `zero or more occurrences of ab':
'abbb' matchesRegex: 'ab*' "-> true"
'abab' matchesRegex: 'ab*' "-> false"
'abab' matchesRegex: '(ab)*' "-> true"
'abcab' matchesRegex: '(ab)*' "-> false: c spoils the fun"
'ac' matchesRegex: 'ab*c' "-> true"
'ac' matchesRegex: 'ab+c' "-> false: need at least one b"
'abbc' matchesRegex: 'ab+c' "-> true"
'abbc' matchesRegex: 'ab?c' "-> false: too many b's"
'ac' matchesRegex: 'ab?c' "-> true: the b is optional"
'ab*' matchesRegex: 'ab*' "-> false: star in the right string is special"
'ab*' matchesRegex: 'ab\*' "-> true"
'a\c' matchesRegex: 'a\\c' "-> true"
`ab*|ba*'
means `a followed by any number of b's, or b followed by any number of a's':
'abb' matchesRegex: 'ab*|ba*' "-> true"
'baa' matchesRegex: 'ab*|ba*' "-> true"
'baab' matchesRegex: 'ab*|ba*' "-> false"
A bit more complex example is the following expression,
matching the name of any of the Lisp-style `car', `cdr', `caar', `cadr', ... functions:
c(a|d)+r
It is possible to write an expression matching an empty string, for example: `a|'
.
However, it is an error to apply `*', `+', or `?' to such expression: `(a|)*'
is an invalid expression.
A character set is a string of characters enclosed in square brackets.
It matches any single character if it appears between the brackets.
For example, `[01]' matches either `0' or `1':
'0' matchesRegex: '[01]' "-> true"
'3' matchesRegex: '[01]' "-> false"
'11' matchesRegex: '[01]' "-> false: a set matches only one character"
Using the plus operator, we can build the following binary number recognizer:
'10010100' matchesRegex: '[01]+' "-> true"
'10001210' matchesRegex: '[01]+' "-> false"
'0' matchesRegex: '[^01]' "-> false"
'3' matchesRegex: '[^01]' "-> true"
Special characters within a set are `^', `-', and `]' that closes the set.
Below are the examples of how to literally use them in a set:
[01^] -- put the caret anywhere except the beginning
[01-] -- put the dash as the last character
[]01] -- put the closing bracket as the first character
[^]01] (thus, empty and universal sets cannot be specified)
'1' matchesRegex: '[1.]' "-> true"
and:
'.' matchesRegex: '[1.]' "-> true"
but not:
'2' matchesRegex: '[1.]' "-> false"
\w any word constituent character (same as [a-zA-Z0-9_])
\W any character but a word constituent
\d a digit (same as [0-9])
\D anything but a digit
\s a whitespace character
\S anything but a whitespace character
These escapes are also allowed in character classes: '[\w+-]' means
'any character that is either a word constituent, or a plus, or a
minus'.
Character classes can also include the following grep(1)-compatible
elements to refer to:
Note that these elements are components of the character classes,
i.e. they have to be enclosed in an extra set of square brackets to
form a valid regular expression.
[:alnum:] any alphanumeric, i.e., a word constituent, character
[:alpha:] any alphabetic character
[:blank:] space or tab.
[:cntrl:] any control character.
In this version, it means any character whith ascii-code is < 32.
[:digit:] any decimal digit.
[:graph:] any graphical character.
In this version, this mean any character with ascii-code >= 32.
[:lower:] any lowercase character
[:print:] any printable character.
In this version, this is the same as [:cntrl:]
[:punct:] any punctuation character.
[:space:] any whitespace character.
[:upper:] any uppercase character.
[:xdigit:] any hexadecimal character.
For example, a non-empty string of digits would be represented as '[[:digit:]]+'
.
A sequence of characters between colons is treated as a unary selector
which is supposed to be understood by characters. A character matches
such an expression if it answers true to a message with that
selector. This allows a more readable and efficient way of specifying
character classes (by adding appropriate protocol to the character class,
it can also be easily extended).
For example, `[0-9]'
is equivalent to `:isDigit:'
,
but the latter is more efficient. Analogously to character sets,
character classes can be negated: `:^isDigit:'
matches a Character
that answers false to #isDigit,
and is therefore equivalent to `[^0-9]'
.
The following messages from Smalltalk's Character protocol are useful here:
As an summarizing example, so far we have seen the following equivalent ways to
write a regular expression that matches a non-empty string of digits:
:isControlCharacter: true if I am a control character (i.e. ascii value < 32 or == 16rFF)
:isDigit: as described above
:isLetter: a-z or A-Z
:isLetterOrDigit: a-z or A-Z or 0-9
:isNationalLetter: any letter in the whole Unicode set (not just a-z, A-Z)
:isNationalAlphaNumeric: any letter or digit from the Unicode set
:isLowercase: any lowercase letter in the Unicode set (i.e. not only a-z)
:isUppercase: any uppercase letter in the Unicode set (i.e. not only A-Z)
:isSeparator: any whitespace (space, nl, cr, tab, ff)
:isVowel: aeiouAEIOU
:isHexDigit: 0-9, a-f, A-F
'[0-9]+'
'\d+'
'[\d]+'
'[[:digit::]+'
:isDigit:+'
. matching any character except a newline;
^ matching an empty string at the beginning of a line;
$ matching an empty string at the end of a line.
\b an empty string at a word boundary
\B an empty string not at a word boundary
\< an empty string at the beginning of a word
\> an empty string at the end of a word
Again, all the above three characters (`.', `^' and `$')
are special and should be quoted to be matched literally.
Examples:
'axyzb' matchesRegex: 'a.+b' "-> true"
'ax zb' matchesRegex: 'a.+b' "-> true (space is matched by `.')"
('ax' , Character cr ,'zb')
matchesRegex: 'a.+b' "-> false (newline is not matched by `.')"
('ax' , Character cr ,'zb')
matchesRegex: 'a(.|\n)+b' "-> true)"
Checking if aString may represent a nonnegative integer number:
or
aString matchesRegex: ':isDigit:+'
or
aString matchesRegex: '[0-9]+'
Checking if aString may represent an integer number with an optional
sign in front:
aString matchesRegex: '\d+'
Checking if aString is a fixed-point number, with at least one digit
is required after a dot:
aString matchesRegex: '(\+|-)?\d+'
The same, but allow notation like `123.':
aString matchesRegex: '(\+|-)?\d+(\.\d+)?'
Recognizer for a string that might be a name: one word with first
capital letter, no blanks, no digits. More traditional:
aString matchesRegex: '(\+|-)?\d+(\.\d*)?'
more Smalltalkish:
aString matchesRegex: '[A-Z][A-Za-z]*'
A date in format MMM DD, YYYY with any number of spaces in between, in
XX century:
aString matchesRegex: ':isUppercase::isAlphabetic:*'
Note parentheses around some components of the expression above. As
`Usage' section shows, they will allow us to obtain the actual strings
that have matched them (i.e. month name, day number, and year number).
aString matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'
For dessert, coming back to numbers: here is a recognizer for a
general number format: anything like 999, or 999.999, or -999.999e+21.
aString matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'
#matchesRegex:
-message to the sample string,
with a regular expression string as the argument.
aString prefixMatchesRegex: regexString
aString matchesRegexIgnoringCase: regexString
aString prefixMatchesRegexIgnoringCase: regexString
#prefixMatchesRegex: is just like #matchesRegex, except that the whole
receiver is not expected to match the regular expression passed as the
argument; matching just a prefix of it is enough.
'abcde' matchesRegex: '(a|b)+' "-> false"
'abcde' prefixMatchesRegex: '(a|b)+' "-> true"
The last two messages are case-insensitive versions of matching.
aString regex: regexString matchesDo: aBlock
Evaluates a one-argument <aBlock> for every match of the regular
expression within the receiver string.
aString regex: regexString matchesCollect: aBlock
Evaluates a one-argument <aBlock> for every match of the regular
expression within the receiver string.
Collects results of evaluations and anwers them as a SequenceableCollection.
aString allRegexMatches: regexString
Returns a collection of all matches (substrings of the receiver string) of the regular expression.
aString regex: regexString matchesCollect: [:each | each].
aString copyWithRegex: regexString matchesReplacedWith: aString
For example:
'ab cd ab' copyWithRegex: '(a|b)+' matchesReplacedWith: 'foo'
returns the string: 'foo cd foo'.
A more general substitution is match translation:
This message evaluates a block passing it each match of the regular
expression in the receiver string and answers a copy of the receiver
with the block results spliced into it in place of the respective
matches.
aString copyWithRegex: regexString matchesTranslatedUsing: aBlock
For example:
results in the string: 'AB cd AB'.
'ab cd ab' copyWithRegex: '(a|b)+' matchesTranslatedUsing: [:each | each asUppercase]
All messages of enumeration and replacement protocols perform a case-sensitive match. Case-insensitive versions are not provided as part of a CharacterArray protocol. Instead, they are accessible using the lower-level matching interface.
aString matchesRegex:
works as follows:
RxParser
is created,
and the regular expression string is passed to it, yielding the expression's syntax tree.
RxMatcher
. The instance sets up some data structure that
will work as a recognizer for the regular expression described by the tree.
You can create a matcher using one of the following methods:
forString:ignoreCase:
message to RxMatcher class
,
with the regular expression string and a Boolean indicating whether case is ignored as arguments.
forString:
message.
... forString: regexString ignoreCase: false
".
regexString asRegex
" is equivalent to
"RxMatcher forString: regexString
".
regexString asRegexIgnoringCase
"
RxMatcher forString: regexString ignoreCase: true
".
hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+'
hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false
hexRecognizer := '16r[0-9A-Fa-f]+' asRegex
hexRecognizer := '16r[0-9A-F]+' asRegexIgnoringCase
matches: aString
matchesPrefix: aString
search: aString
matchesStream: aStream
matchesStreamPrefix: aStream
searchStream: aStream
lastResult
`((ab)+(c|d))?ef'
includes the following subexpressions with these indices:
1: ((ab)+(c|d))?ef
2: (ab)+(c|d)
3: ab
4: c|d
Be aware, that the first subexpressions represents the whole match.
subexpressionCount
subexpression: anIndex
subBeginning: anIndex
subEnd: anIndex
| matcher |
matcher := Regex::RxMatcher new initializeFromString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*19(:isDigit::isDigit:)'.
(matcher matches: 'Aug 6, 1996')
ifTrue:
[Array
with: (matcher subexpression: 4)
with: (matcher subexpression: 2)
with: (matcher subexpression: 3)]
ifFalse: ['no match']
(should answer `#('96' 'Aug' '6')
').
matchesIn: aString
matchesIn: aString do: aBlock
matchesIn: aString collect: aBlock
copy: aString replacingMatchesWith: replacementString
copy: aString translatingMatchesUsing: aBlock
matchesOnStream: aStream
matchesOnStream: aStream do: aBlock
matchesOnStream: aStream collect: aBlock
copy: sourceStream to: targetStream replacingMatchesWith: replacementString
copy: sourceStream to: targetStream translatingMatchesWith: aBlock
RxParser class
protocol.
To handle possible errors, use the protocol described below to obtain the exception objects
and use the protocol of the native Smalltalk implementation to handle them.
If a syntax error is detected while parsing expression,
RxParser » syntaxErrorSignal
is raised/signaled.
If an error is detected while building a matcher,
RxParser » compilationErrorSignal
is raised/signaled.
If an error is detected while matching
(for example, if a bad selector was specified using `:<selector>:' syntax,
or because of the matcher's internal error),
RxParser » matchErrorSignal
is raised
RxParser » regexErrorSignal
is the parent of all three.
Since any of the three signals can be raised within a call to #matchesRegex:,
it is handy if you want to catch them all.
For example:
Ansi-Smalltalk (VisualWorks, SmalltalkX, Squeak etc.):
VisualWorks, SmalltalkX:
[ 'abc' matchesRegex: '))garbage[' ]
on: RxParser regexErrorSignal
do: [:ex | ex returnWith: nil]
VisualAge, SmalltalkX:
RxParser regexErrorSignal
handle: [:ex | ex returnWith: nil]
do: [ 'abc' matchesRegex: '))garbage[' ]
[ 'abc' matchesRegex: '))garbage[' ]
when: RxParser regexErrorSignal
do: [:signal | signal exitWith: nil]
VB-Regex-Syntax
VB-Regex-Matcher
and a few CharacterArray
methods in `VB-regex'
protocol.
No system classes or methods are modified.
String » matchesRegex:
RxParser
Rxs<whatever>
classes.
RxMatcher
RxParser
, RxMatcher
, or both.
The matcher passes H. Spencer's test suite (see 'test suite' protocol), with quite a few extra tests added, so chances are good there are not too many bugs. But watch out anyway.
a. any modified version is expressly marked as such and is not
misrepresented as the original software;
b. credit is given to the original software in the source code and
documentation of the derived work;
c. the copyright notice at the top of this document accompanies
copyright notices of any modified version.
Felix Hack Eliot Miranda Robb Shecter David N. Smith Francis Wolinski
and anyone whom I haven't yet met or heard from, but who agrees this has not been a complete waste of time.
'hello world' matchesRegex: 'h.*d'
Or:
|matcher|
matcher := '.*ll.*' asRegex.
matcher matches: 'hello world'.
Fetching matched subexpressions:
|matcher sub1 sub2 sub3|
matcher := '\D*([0-9]+)\s([0-9]+)\D*.*' asRegex.
(matcher matches: 'bla bla 123456 123 bla bla') ifTrue:[
Transcript showCR:(matcher subexpressionCount printString , ' subExpressions').
sub1 := matcher subexpression:1.
sub2 := matcher subexpression:2.
sub3 := matcher subexpression:3.
Transcript showCR:'subExpr1 is ' , sub1.
Transcript showCR:'subExpr2 is ' , sub2.
Transcript showCR:'subExpr3 is ' , sub3.
].
Legally, it is a freeware or public domain goody, as specified in the goodies copyright notice (see the goodies source).
Vassili Bykov
See RxParser class » boringStuff
for legal information.
Copyright © 1999 eXept Software AG
<info@exept.de>